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Abstract: Our concern is selecting the concentration matrix's nonzero 
coefficients for a sparse Gaussian graphical model in a high-dimensional 
setting. This corresponds to estimating the graph of conditional depen- 
dencies between the variables. We describe a novel framework taking into 
account a latent structure on the concentration matrix. This latent struc- 
ture is used to drive a penalty matrix and thus to recover a graphical model 
with a constrained topology. Our method uses an i\ penalized likelihood 
criterion. Inference of the graph of conditional dependencies between the 
variates and of the hidden variables is performed simultaneously in an iter- 
ative EM-like algorithm. The performances of our method is illustrated on 
synthetic as well as real data, the latter concerning breast cancer. 

AMS 2000 subject classifications: Primary 62H20, 62J07; secondary 
62H30. 

Keywords and phrases: Gaussian graphical model, Mixture model, l\- 
penalization, Model selection, Variational inference, EM algorithm. 

Contents 



1 Introduction 2 

2 A latent structure model for network inference 7 

2.1 Gaussian graphical models: general settings 7 

2.2 Providing the network with a latent structure 7 

2.3 The complete likelihood 8 

3 Inference strategy by alternate optimization 10 

3.1 Variational estimation of the latent structure 10 

3.2 A Lasso-like method to estimate the concentration matrix .... 13 

3.3 Choice of penalty parameters 17 

4 Link with Meinshausen and Biihlmann's approach 19 

5 Numerical experiments 21 

5.1 Synthetic data 21 

5.2 Breast Cancer data 23 

References 27 

A Appendix section 29 

A.l Proof of the equivalence between the constraints 29 

A. 2 Fixed-point study 29 

A. 3 Proof of Lemma 2 (LASSO with pathwise coordinate optimization) 33 

A. 4 Reconstruction of the concentration matrix 34 

A. 5 Pseudo-likelihood of a Gaussian vector 34 

A. 6 Penalization upper bound 35 



1 



C. Ambroise, J. Chiquet and C. Matias/ Inferring sparse GGM with latent structure 2 



1. Introduction 

Estimating the concentration matrix (namely the inverse of the covariance ma- 
trix) of a Gaussian vector in a sparse, high-dimensional setting has received 
much attention recently. Graphical models provide a convenient setting for mod- 
elling multivariate dependence patterns. In this framework, an undirected graph 
is matched to the Gaussian random vector, where each vertex corresponds to 
one coordinate of the vector, and an edge is not present between two vertices 
if the corresponding random variables are independent, conditional on the re- 
maining variables. Now, conditional independence between two coordinates of 
the Gaussian random vector corresponds exactly to a zero entry in the concen- 
tration matrix. Thus, detecting nonzero elements in the concentration matrix 
is equivalent to reconstructing the Gaussian graphical model (GGM, see e.g. 
Lauritzcn 1996). 

We focus here on the crucial problem of selecting the concentration matrix's 
nonzero coefficients. In other words, we focus on variable selection rather than 
estimation. Application areas include gene-regulation graph inference in Biol- 
ogy (using gene expression microarray data), as well as spectroscopy, climate 
studies, functional magnetic resonance imaging, etc. We provide a very novel 
approach driving the graph selection according to an unobserved modular struc- 
ture on the vertices. 

The idea of covariance selection first appeared in the work of Dempster 
(1972). In the so-called "large p, small n" setting (namely when the number 
of observations is smaller than the dimension of the observed response), the 
need for covariance selection is huge, as the empirical covariance matrix is no 
longer regular. 

In Drton and Perlman (2007), a classification of the different methods for 
model selection/estimation in GGMs into three group types is suggested: constr- 
aint-based methods, performing statistical tests; Bayesian approaches; and score- 
based methods, maximizing a model-based criterion. The multiple testing prob- 
lem has been taken into account in Drton and Perlman (2007; 2008). The authors 
perform GGM covariance selection by multiple testing of hypotheses about van- 
ishing partial correlation coefficients. Such procedures may also be implemented 
using the PC-algorithm (Kalisch and Biihlmann 2007). Starting from a com- 
plete, undirected graph, the PC-algorithm deletes edges recursively, according 
to conditional independence decisions. However, the statistical procedure in Dr- 
ton and Perlman (2007) relies on asymptotic considerations, a regime never 
attained in real situations. 

Another attempt in this vein is to consider limited-order partial correlations 
(Wille and Biihlmann 2006, Castclo and Rovcrato 2006). In Wille and Biihlmann 
(2006) the authors consider only zero and first-order conditional dependencies. 
They argue that for sparse graphical models, these low-order dependencies still 
reflect reasonably well the full-order conditional dependency structure. More- 
over, these dependencies may be well estimated even with a small number of 
observations. In Castelo and Roverato (2006), the authors introduce a non- 
rejection rate to reduce the multiple testing and computational problems to 
which these approaches give rise. 

A Bayesian framework is proposed in Dobra et al. (2004), Jones et al. (2005) 
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and their method was applied for evaluating patterns of association in large-scale 
gene expression data. The approach is based on dependency networks, namely 
a collection of conditional distributions {P(Xj|X\ i )} (where X\* stands for the 
set of all variables but Xj). However, such conditional distributions will not in 
general result in a coherent joint distribution. This point is further discussed 
below concerning a similar problem appearing in Meinshausen and Buhlmann 
(2006). Moreover, constructing priors on the set of concentration matrices is not 
a trivial task (it mainly relies on Wishart priors) . The use of MCMC procedures 
limits the range of applications to moderate-sized networks. 

Before focusing on score-based methods, let us first introduce regularization 
procedures. In the context of linear regression, the LASSO (least absolute shrink- 
age and selection operator) technique was introduced by Tibshirani (1996). This 
procedure performs model selection and parameter estimation at the same time. 
The idea is that ordinary least-squares criterion may be improved in a sparse 
context, using an £i-norm penalty. The ^i-norm penalty shrinks the estimates 
to zero while preserving the convexity of the optimization problem. Note that 
the ^-norm penalization is also known as 'basis pursuit' in signal processing 
(Chen ct al. 2001). 

It is well known that if the ultimate goal is parameter estimation, model 
selection and estimation should be done in a single step. Performing the model 
selection prior to parameter estimation in the selected model will, in fact, result 
in a non-robust procedure. However, our primary focus here is on model selec- 
tion, as we want to infer sparse networks. We therefore concentrate on model 
selection rather than on estimation performances. 

The Lars algorithm (Efron et al. 2004) is one of the most popular techniques 
for solving the LASSO problem. It gives the path of solutions obtained when 
varying the penalty parameter (the penalty parameter is used as a scaling factor 
of the ^-norm penalty). The larger the penalty parameter, the sparser the 
Lasso solution. 

Using convex optimization techniques (see for instance Minoux 1986), the 
LASSO problem may be stated as a primal problem, whose dual formulation 
may be solved more easily. This approach is taken in Osborne ct al. (2000b). 
The authors obtain an iterative algorithm, the "homotopy method" (Osborne 
et al. 2000a). Other very efficient approaches are based on focusing on each coor- 
dinate iteratively. Indeed, for each coordinate, the LASSO problem is solved very 
simply (assuming the other coordinates are fixed) by soft-thresholding (Donoho 
and Johnstone 1995). Thus, different 'coordinate optimization' procedures have 
been proposed in the literature. Following the work of Fu (1998), a cyclic pro- 
cedure is proposed in Friedman et al. (2007), where optimization with respect 
to each coordinate is done iteratively; whereas Wu and Lange (2008) propose a 
greedy approach, computing the solution for each coordinate and choosing that 
which provides the largest decrease in a surrogate objective function. Note that 
these approaches rely on the underlying assumption that the predictors for the 
regression problem are uncorrelated. 

Let us now come back to covariance (or concentration) matrix inference in 
GGMs, using maximization of a model-based criterion. 

Meinshausen and Buhlmann (2006) were the first authors to apply LASSO 
techniques for inferring a covariance matrix in a GGM. Their approach is to solve 
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p different LASSO regression problems, where p is the dimension of the observed 
vector. The main drawback of such a procedure is that a symmetrization step is 
required to obtain the final network, ft might, for instance, be the case that the 
estimator of the regression coefficient for Xi on Xj is zero, whereas the estimator 
for Xj on Xi is not zero. Mcinshausen and Biihlmann propose to use either an 
"AND" or an "OR" final step procedure to recover an undirected correlation 
graph. However, these two procedures might result in different estimates and 
there is no way of choosing between them. Moreover, as previously stated, a set 
of conditional distributions does not necessarily cohere into a joint distribution. 
Using a set of possibly non-coherent conditional distributions corresponds to 
a pseudo- likelihood approach. This aspect was not underlined in Mcinshausen 
and Biihlmann (2006), and we clarify this point in Section 4. 

Subsequently, two other articles, Banerjee et al. (2008) and Yuan and Lin 
(2007) independently provided an improvement of the initial work of Mein- 
shausen and Biihlmann (2006). In both works, the problem is seen as a penalized 
maximum-likelihood (PML) problem. Instead of considering p different regres- 
sion problems, these two articles focus on the likelihood of the Gaussian vector, 
penalizing the entries of the concentration matrix with an ^j-norm penalty. 
They explain how the PML estimation may be solved as a "LASSO-like" prob- 
lem. The major issue with PML strategies in the context of the concentration 
matrix estimation of GGMs is to obtain a positive definite estimate. However, 
the approach for solving the problem in Yuan and Lin (2007) is not suited to 
high-dimensional settings, in contrast to the approach proposed in Banerjee 
et al. (2008). In Yuan and Lin (2007), a non-negative garrote-type estimator is 
used, and asymptotic properties (as n tends to infinity while p is held fixed) 
are given. In Banerjee et al. (2008), two different algorithms are proposed for 
solving problems in a high-dimensional setting. The first approach relies on a 
block-coordinate descent algorithm. The second is a semi-definite programming 
algorithm, based on Nesterov's method, which is computationally intensive. 

The next improvement in this vein comes with Friedman et al. (2008). Re- 
lying on coordinate descent techniques, previously described in Friedman et al. 
(2007), the authors revisit Banerjee et al.'s first approach and propose an effi- 
cient algorithm to solve the PML estimation problem, under the positive definite 
constraint. In fact, they use the block coordinate descent approach proposed by 
Banerjee et al. (2008) and combine it with a second coordinate descent method. 
Our method will make use of this approach. 

To conclude this part, we remark that a completely different shrinkage esti- 
mate was proposed by Schafer and Strimmer (2005) in the same context of large- 
scale covariance matrix estimation. The approach consists in using a weighted 
average of two different estimators, the first being unconstrained (thus having 
small bias but large variance), the second being low-dimensional (and thus ex- 
hibiting small variance but large bias). 

Now let us motivate the use of hidden structures in networks. Modularity 
is a property observed in real (biological) networks (see for instance Ihmels 
et al. 2002). Heterogeneity in the node behaviors is an important property of 
these data. For example, so-called 'hubs' are highly-connected nodes, showing 
a different behavior from the rest of the graph nodes. An interesting model 
capturing these network features is a mixture model for random graphs (see 
for instance Daudin et al. 2008). This model has been rediscovered many times 
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in the literature, and a non-exhaustive bibliography should include Frank and 
Harary (1982), Snijders and Nowicki (1997), Nowicki and Snijders (2001), Tall- 
berg (2005), Daudin et al. (2008), Mariadassou and Robin (2007), Zanghi ct al. 
(2008). To state it simply, this model assumes that each node belongs to some 
unobserved group. Conditional on the node groups, the (weighted) edges are 
independent and identically-distributed (i.i.d.) random variables, whose distri- 
bution depends on the groups of the nodes to be connected. As we are interested 
in GGMs, weighted edges correspond to entries of the concentration matrix. 

In this work, we aim at estimating a hidden structure, namely node groups, 
while discovering the network. This hidden structure should help us in choosing 
adaptive penalty parameters. Indeed, we wish to penalize the elements of the 
concentration matrix, according to the unobserved clusters to which the nodes 
belong. For instance, if two nodes belong to the same unobserved group, we wish 
to lower the penalty parameter acting on the corresponding entry in the concen- 
tration matrix. Conversely, if we increase the penalty parameters on the entries 
corresponding to nodes belonging to different groups, we shrink the estimated 
coefficient to zero. Our approach is completely new and improves inference of 
sparse modular networks. 

Another adaptive LASSO procedure is given in Zou (2006), whose idea is to 
lower the bias of the large coefficients by adapting the penalty parameter of each 
coefficient so that it automatically scales with the inferred value. It is known 
that the non-adaptive LASSO procedure may result in inconsistent parameter 
estimation. An illustration of the conflict between optimal prediction and con- 
sistent variable selection for the LASSO procedure is given in Meinshausen and 
Biihlmann (2006). They proved that the optimal penalty parameter for predic- 
tion gives inconsistent variable selection results, motivating the use of another 
penalty parameter to ensure the control of the probability of falsely connecting 
two or more distinct connectivity components of the graph. Like them, we also 
focus on optimal selection rather than on optimal prediction. The adaptivity of 
our procedure is not used for lowering the bias of large coefficients, but instead 
for constraining the prediction to fit the underlying structure of the graph. 

Model. Let us now briefly describe the general approach of our work. The 
model will be presented in detail in Section 2. Let X — (X\, . . . , X p ) T be a 
Gaussian random vector in W, with zero mean and positive definite covariancc 
matrix S, namely X ~ A/"(0 P ,S). We observe independent and identically- 
distributed (i.i.d) vectors (X 1 , . . . ,X n ) with the same distribution as X. The 
matrix K = S _1 is the concentration matrix of the model. Let S be the empirical 
covariance matrix. The log-likelihood of the observations is given by 

£(K) = - log det(K) - - Y,(X k ) J KX k +c=~ logdet(K) - ™Tr(SK) + c 

fe=i 

where c is a constant term. 

The ^-penalized estimator proposed by Banerjee et al. (2008) is given by 

K = argmax logdet(K) - Tr(SK) - p\\K\\ tl , (1) 
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where K >- stands for positive definiteness, p > is a penalty parameter and 

iiKik = E y i*y- 

A natural generalization of this approach is to have different penalty param- 
eters for different entries Kij. Namely 

logdet(K)-Tr(SK)-||p(K)|| £l , 

where p(K) = (flij{Kij))ij^-p is a matrix of penalty functions acting on each 
entry. As a general rule, using as many penalty functions as there are entries in 
the concentration matrix to be estimated is not meaningful. 

Here, we propose to take into account a hidden structure on the correlations 
between the coordinates random variables Xk- Thus, we consider latent i.i.d. 
random variables Zi, . . . , Z p with values in a finite set {1, . . . , Q}. Each variable 
Z j describes the state of Xi , and we wish to adapt the penalty function pij with 
respect to the states of Xi, Xj. More precisely, we wish to use a criterion of the 
form 

logdet(K)-Tr(SK)-||Pz(K)k, 

where p z (K) — (pz^z^ {^ij))i,je"P is a matrix of random penalty functions whose 
entries depend on the latent structure Z = Z i,..., Z p . However, the hidden 
structure is not supposed to be known, thus we cannot rely on the previous 
criteria. Intuitively, following the principle of Expectation-Maximization (em) 
algorithm of Dempster et al. (1977), the idea will be to replace the unobserved 
value p z (K) with its conditional expectation E(p z (K)|X 1 , . . . , X n ; K.( m >) under 
some model with parameter K^ m \ and iterate the following steps 

(e) Compute E(p z (K)|X\ . . . , X n ; RW) 

(m) Update K( m+1 ) = argmax K ^ E(p z (K)|X 1 , . . .,l n ;KH). 

One of our aims is to provide a very simple framework for such an analysis. 

Note that the ^i-norm used here acts on diagonal elements of the matrix K. It 
is counter-intuitive to penalize diagonal elements of the concentration matrix, 
as these do not reflect sparsity in the correlation structure. However, from a 
technical point of view, this strategy ensures that the procedure will select a 
positive definite estimator (see Remark 2). This point was not emphasized in 
the previous procedures using i\ penalized likelihood of GGMs. 

Road-map. In Section 2 we present the model and the penalized maximum- 
likelihood criterion on which we base our inference procedure, described in Sec- 
tion 3. This procedure relics on a variational em algorithm, combined with a 
LASSO-like procedure. Section 4 explains how Meinshausen and Biihlmann's 
approach may be interpreted as a penalized pseudo-likelihood method. Finally, 
Section 5 illustrates the performance of the method on synthetic data, for which 
an R-package, SIMoNe (Statistical Inference for Modular Network), can be down- 
loaded from the first author's website. We also test our algorithm on a real data 
set provided by Hess et al. (2006) and concerning n — 133 patients with breast 
cancer treated using chemotherapy. According to Hess et al. (2006) and Natow- 
icz et al. (2008), the patient response to the treatment can be classified either 
as a pathologic complete response (pCR), or as a residual disease (not-pCR). 
The prediction of the patient response is achieved accurately by studying the 
expression levels of a limited number of genes (j> — 26). Our algorithm is ap- 
plied on each class of patients (pcR and not-pCR) . Two distinct gene-regulatory 
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networks are thus inferred, showing a very different structure according to the 
selected class of patients. 

2. A latent structure model for network inference 

In this section we present a framework for modelling heterogeneity among depen- 
dencies between the variables. To this end, let us first recall classical notations 
from Gaussian Graphical Models (see Lauritzcn 1996, for elementary results 
about GGMs). 

2.1. Gaussian graphical models: general settings 

Let V = {1, . . . ,p} be a set of fixed vertices, X = (Xi, . . . , X p ) T a random 
vector describing a signal over this set and a sample (X 1 , . . . , X n ) of size n with 
the same distribution as X. 

The vector X is assumed to be Gaussian with positive definite covariancc 
matrix 2 = ,j)ev 2 - No loss of generality is involved when centering X, 

so we may assume that X ~ Af(O p , X). GGMs are based on a classical result, 
originally emphasized by Dempster (1972), claiming that any couple of entries 
(Xi,Xj) with i ^ j are independent conditional on all other variables indexed 
by V\{i, j}, if and only if the entry (E -1 )^- is zero. The inverse of the covari- 
ance matrix K = (iQj)(i,j)e'P 2 = X - , known as the concentration matrix, thus 
describes the conditional independence structure of X. Moreover, each entry 
7^ j is directly linked to the partial correlation coefficient r ij\v\{i,j} be- 
tween variables Xi and Xj. In fact, we have i"ij\v\{i,j} — — if y / y if n Kjj , and 
also Ka = Var(Xj|X-p\j) _1 . Hence, after a simple rescaling, the matrix K can 
be interpreted as the adjacency matrix of an undirected weighted graph Q rep- 
resenting the partial correlation structure between variables X\, . . . ,X p . This 
graph has no self-loop, with a random set of edges composed by all pairs (i,j) 
such that Kij ^ 0. Note that we are seeking only pairs of vertices (i, j) such 
that i < j, since there is no self- loop, and since = Kji. Inferring nonzero 
entries of K is equivalent to inferring Q, and is therefore a highly relevant issue 
in this framework. 

2.2. Providing the network with a latent structure 

Let us now extend the modeling by providing the network with an internal latent 
structure. 

The model proposed in Daudin ct al. (2008) attempts a better fit of data, as 
it places the network Q in the mixture framework, in order to take account of 
the heterogeneity among vertices. The same general mixture model is adopted 
here: vertices of V are distributed among a set Q = {1, . . . , Q} of hidden clusters 
that model the latent structure of the network. For any vertex i, the indicator 
variable Z iq is equal to 1 if i G q and otherwise, hence describing which cluster 
the vertex i belongs to. A vertex is assumed to belong to one cluster only, 
thus the random vector = (Zn, . . . , Ziq,) obviously follows a multinomial 
distribution. Namely, 

Zi~M(l,a), (2) 
where a = (aj., . . . , chq) is a vector of cluster proportions, so that a q = 1. 
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The concentration matrix structure. We shall now extend the clustering 
of vertices from V to the concentration matrix K. Accordingly, both the exis- 
tence and the weight of edges, described by the off-diagonal elements of K, will 
depend on the cluster each vertex belongs to. Conditional on the events i e ij 
and j € £ where q, I are clusters chosen from Q, each Kij (i ^ j) is a random 
variable whose probability density function is denoted by f q i, that is, 

Kij\ {Z iq Z je = 1} - f qf (-), i ^ j. (3) 

It will be remarked that in this formulation the variables are assumed to 
be independent, conditional on the clusters the vertices belong to. Moreover, 
we are considering only undirected graphs, so we may assume that f q £ = fi q . 
For technical reasons (see Remark 2), we also assume a prior distribution on 
diagonal elements of K, namely 

K u ~ /«,(•)• 

Our suggestion is to adopt Laplace distributions; hence 

V^R, f ql {x)= J-expj-M), and f (x) = exp J-M) , (4) 

where Aq^Ao > are scaling parameters and X q e = \p q . Below, the parameter 
Ao will be fixed and not estimated. 

The reason for choosing a Laplace distribution is that it is reminiscent of 
the ^i-norm, itself linked to LASSO-techniques for which appropriate tools are 
available. In fact, when considering the general penalized least-square problem, 
the penalty term can be seen as a log-prior density on the vector of parameters. 
In the case of LASSO, the prior distribution corresponding to the £i-norm is 
actually the Laplace distribution (see, e.g. Hastie et al. 2001). 

The affiliation model. The affiliation model is a special case of network 
structure (to be investigated below), where there are many different clusters, 
but where the focus is restricted to two types of edges: edges between nodes 
of the same cluster, and edges between nodes from different clusters. In the 
affiliation model the densities f q t in (4) are of only two kinds; that is, for all 
q,£eQ, let 

, _ J f qq = /i n (-; Ai n ) if q = £, the intra-cluster density of edges, 
q I fqi = /out (•; A ou t) if 9 A the inter-cluster density of edges. 



2.3. The complete likelihood 



Having described the modeling of the network, we now focus on the inference 
issue. 

We denote as X the nxp matrix that contains the data-set {X 1 , X 2 , . . . , X n } 
row-wisely organized, i.e., (X k ) T is the kth row of X. Furthermore, we denote 
as Z = {Z iq } i£ -p yq& Q the set of all latent indicator variables for vertices. For the 
sake of simplicity, the number of clusters Q and the parameters a. — (a q ) q eQ 
and A = {\ q i\ q ,i^Q are assumed to be known for the moment. 

The data experiments X are the only observations available, and from these 
we should like to be able to infer the graph Q of conditional dependencies or, 
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equivalently, nonzero entries of K. As the matrix K has been given a prior 
distribution, our aim is to maximize the posterior probability of K, given the 
data X, or equivalently, the logarithm of the joint distribution. The estimate is 
thus defined as follows: 

K = argmaxP(KIX) = argmax logP(X.K), 

K>0 K>0 

where K >- stands for positive-definiteness. 

To solve this problem, we place ourselves in the classical complete-data frame- 
work. The distribution of K is only known conditionally on the latent structure 
described by Z. We denote as Z the set of all possible clusterings over nodes 
from V. The marginalization over the latent clusters Z leads to 



K = argmax log ^ £ e (X, K, Z), 



where the so-called complete-data likelihood £ C (X,K,Z) = P(X, K,Z) is the 
function we shall develop using an EM-like strategy hereafter. For this purpose, 
a closed form of C c is required. 

Proposition 1. The following relation holds for the complete- data likelihood 

log£ c (X,K, Z) = - (logdet(K) - Tr(SK)) - \\p z (K)\\ ti 

- J2 Z iq Z jl \og{2\ ql )+ Z lq loga q + c, (6) 



g,eeQ 



where S = n 1 (X — X) T (X — X) is the empirical covariance matrix, c is a 
constant term and p z (K) = (pZiZj (Kij)) i . £V is defined by 



'** (7) 
otherwise. 



Proof. Using the Bayes rule, C c divides into three terms: 

log£ c (X, K, Z) - logP(X,K, Z) = logP(X|K) + logP(K|Z) + logP(Z), 
where we make use of the fact that logP(X|K, Z) = logP(X|K). 

The first term is the likelihood associated with a size-n sample of a multi- 
variate Gaussian distribution, since X <~ 7V(0 p , S). Routine computations lead 
to 

logP(X|K) = |logdet(K) - |Tr(SK) - 7 |log(2^). 
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As regards the second term, using the expression (4), we have 

log P(K|Z) = J2 Z « Z H lo S MKii) + E lo S M K «) 
q,teQ 

From (2), we have logP(Z) = J^i q Zi q \oga q , and the result follows. □ 
3. Inference strategy by alternate optimization 

In the classical EM framework developed by Dempster ct al. (1977), where X 
is the available data, inferring the unknown parameters K spread over a latent 
structure Z would make use of the following conditional expectation: 

Q (K|K< m) ) =E{log£ c (X,K,Z)|X;K( m) } 

= E P(z|X,K( m ))log£ c (X,K,Z) = E p ( z | K(m) ) log£ c (X,K,Z), 
xez xez 

(8) 

where K^ 1 ™) is the estimation of K from the previous step of the algorithm. 

The usual EM strategy would be to alternate an E-step computing the con- 
ditional expectation (8) with an M-step maximizing this quantity over the pa- 
rameter of interest K. Unfortunately, no closed form of Q (K.\K^ m n can be 
formulated in the present case. The technical difficulty lies in the complex de- 
pendency structure contained in the model. Indeed, P(Z|K) cannot be factor- 
ized, as argued in Daudin ct al. (2008). This makes the direct calculation of 
Q (K|K' m ') impossible. To tackle this problem we use a variational approach 
(see, e.g., Jaakkola 2000, for elementary results on variational methods). In 
this framework, the conditional distribution of the latent variables P(Z|K^ m )) 
is approximated by a more convenient distribution denoted by i? m (Z), which 
is chosen carefully in order to be tractable. Hence, our EM-like algorithm deals 
with the following approximation of the conditional expectation (8) 

E Rm {log£ c (X,K,Z)}= E i?m(Z)log£ c (X,K,Z). (9) 

zez 

In the following section we develop a variational argument in order to choose 
an approximation R rn (Z) of P(Z|K' m ^). This enables us to compute the condi- 
tional expectation (9) and proceed to the maximization step. 

3.1. Variational estimation of the latent structure (F,-step) 

In this part, K is assumed to be known, and we are looking for an approximate 
distribution R(-) of the latent variables. The variational approach consists in 
maximizing a lower bound J of the log- likelihood log P(X, K) , defined as follows: 



J (X, K, R(Z)) = logP(X, K) - V KL {i?(Z)||P(Z|K)} (10) 
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where T)kl is the Kiillback-Leibler divergence. This measures the difference 
between the probability distribution P(-|K) in the underlying model and its ap- 
proximation An intuitively straightforward choice for R(-) is a completely 
factorized distribution (see Mariadassou and Robin 2007, Zanghi ct al. 2008) 

R T (Z) = Y[h Ti (Z i ), (11) 

iev 

where h T . is the density of the multinomial probability distribution .M(1;t;), 
and Ti — (th, . . . ,Tig) is a random vector containing the variational param- 
eters to optimize. The complete set of parameters r = {r iq } ieV qeQ is what 
we are seeking to obtain via the variational inference. In the case in hand the 
variational approach intuitively operates as follows: each Ti q must be seen as 
an approximation of the probability that vertex i belongs to cluster q, condi- 
tional on the data, that is, r, g estimates V(Zi q = 1|K), under the constraint 
Tlq T iq = 1- I n t ne ideal case where P(Z|K) can be factorized as J| i P(Zj|K) 
and the parameters Ti q are chosen as Ti q = V(Zi q = 1|K), the Kiillback-Leibler 
divergence is null and the bound J reaches the log-likelihood. 

The following proposition gives the form of the lower bound J to be maxi- 
mized in order to estimate r. 

Proposition 2. Let us assume that R T can be factorized as in (11), and let us 
denote J T (X,K) := J (X, K, Rr(Z)). Then J T satisfies the following expres- 
sion 

J-r (X, K) = C - ^ T iq log T lq + ^ T iq lo S a q 

iev iev 
qeQ qeQ 

-\\p r (K)\\ ti - n qTjl log 2A g/) (12) 

q,eeQ 

where c does not depend on r and p T (K) = (p T<x - (Kij))i,jev 2 * s defined simi- 
larly as (7), replacing Zi q by Tiq. 

Proof. Starting from (10), classical results on variational methods show that 

where Tt(R T (-)) is the entropy of the distribution R T (-) and Q T (K) is the ap- 
proximation of the complete log-likelihood conditional expectation, computed 
under the distribution R T . Namely, 

g T (K) = E^{log£ c (X,K,Z)}= ]Ti? x (Z)log£ c (X,K,Z). (13) 

7,ez 

In the special case of factorized distribution (11), the entropy is 

H(RAZ)) = J2 n ( h ^)) = - zZ n q \ogr iq . 

iev iev,qeQ 

Moreover, 

Q T (K) = logP(X|K) + Epilog P(K | Z)] + E flT [log P(Z)]. 
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Equation (12) follows via Proposition 1, by using that E,R^(Zi q ) = n q and 

The optimal approximate distribution R T is then derived by direct maxi- 
mization of J T . The following proposition gives the estimate f that solves the 
problem. 

Proposition 3. Let a and A be known. The following fixed-point relationship 
holds for the optimal variational parameters T = argmax.,- J T 



n 



J6P\{<} 



' exp( }) , CI4) 



where oc means that there is a scaling factor such that for any i G V , we have 

Y, q % = 1. 

Proof. This is just an adaptation to the Laplace case of Mariadassou and Robin 
(2007, Proposition 3). □ 

The initial value of r is chosen using a classification algorithm such as spectral 
clustering (see for instance Ng ct al. 2002). As a consequence, the initial values 
for Ti q lie in {0, 1}. We then use an iterative procedure setting f^ 7l+1 '> = g(T^ m '), 
where g is the function (implicitly defined above) for which r is a fixed point. 
Note that we cannot ensure uniqueness of the fixed point for g, nor convergence 
of this iterative procedure. In practice, we can always use a maximal number 
of iterations, and if convergence has not occurred, we keep the initial value of 
t given by the clustering method. In appendix A. 2 we explain that at least in 
the affiliation model (5), if the current values of the precision matrix are 

small enough, and if the penalty parameters Aj^ 1 and A",^ are well-chosen, then 
uniqueness of the fixed point is ensured. However, such a result does not hold 
in the general case, which is one of the drawbacks of of the variational approach 
in this context. 



Estimation of ex and A. The parameters a and A have been previously 
considered as known to keep the statement as clear as possible. 

Two different strategies may be used with respect to these parameters. The 
first approach is to fix their values. Fixing the value of a comes down to choos- 
ing a priori the proportions of the groups, which is quite a common strategy 
in mixture models. As for the choice of A, this is equivalent to choosing the 
penalty parameter in the classical LASSO. Concerning general parameters A, a 
number of values need to be determined, which might be a problem. However 
in the particular affiliation model (5), only 2 parameters have to be fixed: a pa- 
rameter A; n that corresponds to a light penalty, since many intra-cluster edges 
are expected, and another parameter A out that fits with a heavier penalty, since 
we do not expect many inter-cluster edges. This is typically the kind of strategy 
that will be used for numerical applications (see Section 5) . More generally, the 
matrix penalty can be tuned to obtain a desired quantity of inferred edges, or 
to constrain the topology of the graph, e.g. graphs with hubs. 

The second strategy is to make use of the current inferred graph to estimate 
the parameters. The basic idea is to include this estimation in the variational 
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method. Unfortunately, the maximization of J T given in equation (12) with 
respect to r, A and a at the same time is not possible. To tackle this problem, 
we use an alternate strategy. The parameter t is computed with the fixed- 
point relationship (14) for fixed values of A and a. Then we maximize J7i- with 
respect to A and a, once R T is fixed (that is, once t is fixed), as in the following 
proposition. We successively iterate these two steps until stabilization. 

Proposition 4. For fixed values of t, the parameters a, A maximizing J T are 
given by 

vq,i eg, a q = - 2_^T lq and \ ql = — = — — . 



Proof. Once terms that do not depend on the parameters of interest have been 
removed from J7t-, the problem becomes 

a q = argmax V] r lq log a q and \ qe = argmax - Y~] T lq Tji f + log 2\ q£ ) . 

Null-differentiation with respect to a q (under the constraint ^ a q — 1) and 
X q e leads straightforwardly to the result. □ 



3.2. A Lasso-like method to estimate the concentration matrix (the 
M-step) 

Now that we are able to compute the approximate conditional expectation 
Qt-(K) defined by (13), we wish to infer the concentration matrix K, assuming 
r is known. This is the aim of the M-step of our EM-like strategy, that deals 
with the maximization problem argmaxj<>o Qt-(K). 

Using Proposition 1 and the equality E^ T (Z jg Zj^) = Ti q Tj£, it is a simple 
matter to rewrite the problem as follows 

K = argmax (log det(K) - Tr(SK)) - \\p T (K)\\A . (15) 

K>0 L 1 > 

Hence, our M-step can be seen as a penalized maximum likelihood estimation 
problem, exactly like in Friedman et al. (2008), Banerjce et al. (2008). The 
likelihood considered here is P(X|K), that is, the likelihood which corresponds 
to the n realizations of the Gaussian vector X for a given concentration matrix 
K. The difference of our approach lies in the complexity of the penalty term, 
and in slight discrepancies as regards some constant factors. 

Remark 1. Since we are using a penalty term 1/A on matrix K's diagonal 
elements, the solution to (15) satisfies 

VieT, K^ 1 = S ii + 2/(nA ), (16) 

when Aq 1 < n\Su\/2 for any i G V . Indeed, the sub-gradient equation is 
n/2(K^ i 1 — So) + sgn(Ku) / \q = 0, and Ka > since it is the inverse of a 
conditional variance. 

Let us now look at the solution of the M-step: the following proposition gives 
an equivalent formulation of (15) that is more likely to be solved. The result 
draws its inspiration from Banerjce et al. (2008). 
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Proposition 5. The maximization problem (15) over the concentration matrix 
K is equivalent to the following, dealing with the covariance matrix S 

£ = argmax logdet(S), (17) 

||(s-s)./p t |U<i 

where ■/ is the term-by-term division and 

P -(P ) „ with P - J 2n_1 ^Vj^ 1 Ml, 
Px-t^W toitft ^r, -| 2(nA )- 1 i = j. 

Remark 2. By penalizing the diagonal terms of the concentration matrix K in 
the initial problem, the set of matrices X over which we maximize our criterion 
contains, for instance, the matrix S + 2/(nAo)/, (where / stands for the identity 
matrix). Thus, provided that the value of penalty parameter 1/Ao is set suffi- 
ciently high, this set contains positive definite matrices. This ensures that our 
estimator is always invertible. Obviously, when S is invertible, which is usually 
true for n greater or equal than p, penalizing the diagonal terms becomes futile. 
In this case 1/Ao is set to zero. 

Proof. The penalty term in (15) can be written as follows 

\\PAK)\\ (1 = £ J2 l f^r iq r ji + J2^= £ IIT^K^, 

where * is the term-by-term product. The set \^qt} q i^Q contains p x p sym- 
metric matrices, defined, for each couple (g, £), by 

T> = (Tqe-.ij),, 7 - e -p with Vi ^ j, T qi . tij = T^Ill an d T qtAi = t— ^o- 

Let us now use the fact that || A\\i 1 = max|| U || oo <i Tr(AU), for a given matrix 
A. The optimization problem (15) can now be written as 



max min i - logdetK - Tr -SK + V (T ai + K) U 
K^o { u 3i :||tMU<i}| 2 6 I 2 v j^Q 

since the trace operator is linear. The dual version of the above expression is ob- 
tained by swapping max and min. The maximization is solved by differentiating 
with respect to K. To do this, we recall that in our specific case the matrices 
T are symmetrical, and thus Tr((T*K)U) = Tr(K(T*U)). Then, applying 
the usual rules for the derivative of the trace operator, null-differentiation with 
respect to K yields 

S := KT 1 = S + 2 Y (LV * TV) . (18) 
qMQ 

The dual problem therefore becomes 

f n . . np \ 
mm { — — logdct(S) — > , 
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or in other words, 

max logdet(S). 

{U«*:||U«<Hoo<l} 

Finally, we need to write the constraint as a function of X rather than the set 
{Ug^}. In fact, we simply need to show that 

{IV.Vg.fe CIHMoo < 1}={'£;\\('£-S)-/P t \\ oo < l}, 

which is straightforward (see Appendix A.l for details). □ 

To solve (17) and thus obtain the estimate S, we successively use two coordi- 
nate descent methods. The first corresponds to a block-wise strategy suggested 
by Banerjee et al.. The second one is used to solve the resulting LASSO problem 
and was suggested by Friedman et al. (2007). 

Let us first explain the block- wise strategy. For this purpose, we introduce 
the following notation for X, S and the penalty matrix P T 

V* _ ^11 a 12 

^ — -J V 

where E n , Su andPn are (p— l)x(p— 1) matrices, <Xi2, si 2 and P12 are (p— 1) 
length column vectors and £ 2 2 , S22 and P 22 are real numbers. We have already 
remarked (Remark 1) that the solution to (17) satisfies E 22 = S22 + 2/(nAo). 
Moreover, using Schiir complement, the vector <Ti 2 satisfies 

= argmin (y T S u y) . (20) 

{y:||(y-si2)-/pi2lU<i} L > 

We have det(S) = det(Sn)(S 22 — <t[ 2 £ 1:l ctyi)- The full matrix £ is approx- 
imated in the following way: first, if required when p is greater than n, we 
initialize the procedure with S + 2/(nXo)I, where Ao > is chosen so as to make 
S + 2/(nXo)I invertible; secondly, we permute the columns (and thus the rows) 
of £ and iteratively solve problems like (20) until convergence of the procedure. 
This convergence is ensured by the following lemma. 

Lemma 1. The procedure which starts with a positive definite matrix and iter- 
atively updates the columns and rows of this matrix according to the solutions 
of (20) converges to the solution £ of (17). 

Proof. The proof relies on Banerjee et al. (2008, Theorem 3) and Tseng (2001, 
Theorem 4.1). Convergence of block-coordinate descent methods is a well-documented 
topic in convex optimization literature. Here, we have to bear in mind that using 
£i-norm penalty leads to non-differentiable functions. Thus, we rely on a result 
by Tseng (2001, Theorem 4.1), which in our case ensures the convergence of the 
procedure, provided there is at most one solution to each minimization problem 
(20). This point is proved in Banerjee et al. (2008, Theorem 3). □ 

Then, starting from a result given in Banerjee et al. (2008), an interpretation 
of (20) as an ^-penalized problem is given in Friedman et al. (2008). This l\- 
penalizcd problem is reminiscent of the LASSO and may thus be solved using a 
coordinate descent strategy (Friedman et al. 2007). The following proposition 



S12 

S22 



Pr = 



Pll 
Pl2 



Pl2 

-P22 



(19) 



C. Ambroise, J. Chiquet and C. Matias/ Inferring sparse GGM with latent structure 16 



enunciates a result similar to those obtained in Banerjee et al. (2008, equation 
(6)) and Friedman et al. (2008, equation (2.4)), although with a more general 
penalty term and a factor \ that differs. Since none of these articles gives an 
explicit proof for this result, it is fitting that we provide our own proof here. 

Proposition 6. Solving (20) is equivalent to solving the dual problem 



(3 = argmin 

/3 



1^1/2 1/2 

2 S n @ ~ S ll S 12 



2 



|Pl2*/%, (21) 



where solution cr\i to (20) and f3 to (21) are linked through 

B 12 = S n 3/2. (22) 
Proof. Problem (20) can be written as follows, by splitting the constraint: 



-l 



min y y T S n y 

subject to -(pia)i < 2/i - (si2)i - (pi2)i < 0, Wi = 1, . . . ,p - 1, 
or -(pi2)i < ~Vi + (si 2 )i - (Pi2)i < 0, Wi= l,...,p- 1. 

Let us introduce L the so-called Lagrangian, with vectors of Lagrange coeffi- 
cients denoted by [3 1 = (/3*)i<p-i, 1 = (/3f)i< p -i with nonnegative entries. 
Also, let /3 = /3 2 — /3 1 . The Lagrange version of the above problem is 

mm {y T £ny + maxL(/3)| , (23) 

where, in the present case, L is given by 

L(f3) = £ ft ( w - (s 12 )i - (p 12 )i) + A' (~Vi + " (Pw)0 . 

i i 

The coefficients Z^ 1 and (3f maximizing L(f3) are null when the constraints are 
satisfied, and for each index i, at least one coefficient among {f3l,/3f} is zero. 
Then 

Meanwhile, consider the dual problem of (23), swapping min and max: the 
solution that minimizes the dual problem with respect to y satisfies the null- 
gradient hypothesis. We obtain 2£ n y — [3 = 0, that is y = |Sn/3 (which 
proves equation (22)). Introducing this result in the dual of (23), we get 





also equivalent to 



max- J/3 T Sn/3 + s} 2 /3 - £ (ft + ft) ( Pla ) 4 



min VsuyS-sj^+Hpra*^ 



J8 4' 

Expressing this quantity by using the Euclidean norm achieves the proof. □ 

Hence, the column a 12 of the estimated covariance matrix S is computed by 
solving the LASSO problem (21) using another coordinate descent method. 
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Lemma 2. The solution to (21) is computed by updating the jth coordinate of 
(3 via 

% = 2S ( (s 12 ) j - \ ^(S u )iA ; (Pia)j /(Su)i;, (24) 

where S{x; p) = sgn(a;)(|a;| — p) + is the soft-thresholding operator. 

Moreover, the procedure which iteratively updates the entries of vector er 12 = 
Sn/3/2 according to the solutions (3 of (24) converges to the solution of (20). 

Proof. The proof of this lemma is postponed to Appendix A. 3. □ 

Finally, the estimate of the matrix of concentration K is recovered by in- 
verting £, which can be done at low computational cost (see appendix A. 4 for 
details). Hence, we solve the initial maximization problem (15) that defines the 
M-step of our algorithm. 

Implementation of the full em algorithm is outlined in Algorithm 1 . 



Algorithm 1: The full EM-likc algorithm 

while <2 x (K( m )) has not stabilized do 

//THE E-STEP: LATENT STRUCTURE INFERENCE 
if m = 1 then 

// First pass 

Apply spectral clustering on the empirical covariance S to initialize t 

else 

Compute t with the fixed-point relationship (14), using K^" 1 " 1 ' 

//THE M-STEP: NETWORK INFERENCE 

Construct the penalty matrix P according to t 

— (to) 

while S has not stabilized do 

for each column of S do 

Compute <Ti2 by solving the LASSO-like problem with path-wise coordinate 
_ optimization 

— { ■) — ( m ) 

Compute IC m J by block inversion of S 

_ m <— m + 1 



3. 3. Choice of penalty parameters 

As previously stated, the penalty parameters A may be estimated in the E-step 
of the algorithm (see subsection 3.1). However, this choice is not necessarily 
optimal for the estimation of K, and other choices might in practice lead to a 
better solution. A good strategy is to keep the estimated value of A in the E-step 
that leads to the estimation of r, and to impose another value of A during the 
M-step. In this part, we indicate a possible choice for the penalty parameters to 
use in the M-step, ensuring a small error on the connectivity components of the 
estimated graph. 

Let us first introduce some notation. For any node i E V, let Ci denote the 
connectivity component of node i in the true underlying conditional dependency 
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graph, and Cj the corresponding component resulting from the estimate K of 
this graph structure. The following proposition is based on Meinshausen and 
Biihlmann (2006, Theorem 2) and Banerjee ct al. (2008, Theorem 2). 

Proposition 7. Fix some e > and choose the penalty parameters X such that, 
for all q,i € Q, 



2p 2 F n _ 2 ( -|- f : >mx.S'„.s; ;; - 2- j (ti - 2) 1 ' 2 \ < e, (25) 




where 1 — F n -2 is the c.d.f. of Student's t- distribution with n — 2 degrees of 
freedom. Then 

V(3k,d k <£ C k ) <e. (26) 

Proof. Here we simply indicate the main differences between the proof of Baner- 
jee et al. (2008, Theorem 2) and what is valid in our context. Note that according 
to (15), the estimator K must satisfy the following sub- gradient equation 



where My S sgn(Kij). Following the proof of Banerjee et al. (2008, Theorem 2), 
we easily get 

V(lk,d k £C k )<p 2 max pf^|>V^^ 
Performing some computations involving the correlation between variables Xi 

l/2\ 



and Xj, we also obtain 



/ 2(n-2) 1 / 2 I 1 

¥(3k, C k £ C k ) < 2p 2 max F„_ 2 ^ max S u S n - 

!/£S 1 n\ ql y iev ,j^c z A;^ 

which entails the conclusion. □ 

Remark 3. Following Banerjee ct al. (2008), note that in order to ensure (25), 
it is enough to choose the penalty parameter A such that, for all q,£ E Q, 

V(e)>^(rc-2 + 4- 2 (^r)) (maxS^) *»-2(^a) , 

where t n ^2(u) is the (1 — w)-quantile of Student's i-distribution with (n — 2) 
degrees of freedom, i.e. F n -2{t n -2{u)) = u. 

Remark 4. Inequality (25) does not take into account that different penalty 
parameters are used for different hidden classes q, £ E Q. An adaptation of the 
preceding strategy is to use current values Z^" 1 ) obtained from the probabilities 
T (™) f th e hidden classes and to choose the current penalty parameters X^ m ' 
accordingly. More precisely, let us set, for instance 

(m) _ / 1 if q = argmax f 



^ iq I otherwise 
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Then, when 



2p 2 F n 



-1/2 



V 



(m) 
ql 



\z {m) z^ 



(n-2) 



1/2 



<e, (27) 



/ 



for all q,( <E Q, the current estimate K> m ) of the dependency graph will approx- 
imately satisfy (26). Moreover, in order to ensure (27), it is enough to choose, 
for all q,£eQ, 



e 

v 



1/2 



/ 



-1/2 



max SaSjj 



\ <-q it 



I 



e 

v 



(28) 

Typically, the kind of values obtained with (28) will lead to large penalties 
and, consequently, to very sparse graphs: practically, more informative networks 
can be obtained by replacing the term e/2p 2 in (28) by greater values. In any 
cases, (28) should be seen as a starting value. 



4. Link with Meinshausen and Biihlmann's approach 

We should also like to fill the gap between, on the one hand solving (15) and, 
on the other, the approach proposed in Meinshausen and Buhlmann (2006), 
where p independent penalized regression problems are solved using the LASSO. 
In fact, we shall show that Meinshausen and Biihlmann's approach is equivalent 
to maximizing the penalized pseudo log-likelihood corresponding to the size- 
n sample of the multivariate Gaussian vector X on the set of non symmetric 
matrices. Let us denote as C this pseudo- likelihood, defined by 

log£(X;K) =J2 (i>g p (*M\i; K *)) . 

ig-p \fc=i / 

where X^,^ is the fcth realization of the Gaussian vector X, once the ith coor- 
dinate has been removed. In this section, the £]-norm of matrices is restricted 
to off-diagonal elements only, that is, ||A||^ 1 = J2i^j 

Proposition 8. Consider the solution K pseudo to the penalized pseudo-likelihood 
problem 

K pseudo = argmax log£(X;K) - ||P *K||, 1 , (29) 

(whose diagonal is fixed) and the solution K MB given in Meinshausen and 
Buhlmann (2006) to thep different regression problems, using the matrix penalty 
2P/n. The two solutions have exactly the same null entries. 

Proof. Denote by Kuw and Suu, respectively, the matrices K and S once 
their ith row and zth column have been removed. Moreover, K,-w and S,-w are 
the zth rows of the matrices with the zth term removed. After some routine 
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computations, and using classical results for Gaussian multivariate vectors (see 
Appendix A. 5), it can be shown that 

log£(X; K) = ( lo & K ™ ~ K iiS» " 2S iV K Ai - -Lk^S^KJ 

(30) 

where c does not depend on K. Thus, if we forget the symmetry constraint 
on K, maximizing the pseudo- likelihood (30) with respect to the non-diagonal 
entries of K is equivalent to p independent maximization problems with respect 
to each column Kj^. Consider, for instance, the last column of K, that is, for 
i = p, and the relative term in (30). This term can be written as 

(2ir 22 sI 2 K[ v + Kj v S 11 K Ai , 



2K 7 



2K- 



22 



S ll 2K J\i + •^22S 11 1/2 S 12 



2 



+ c, 



where we use the block- wise notation defined above (19). The term C does not 
depend on Kj\j, which is the current column of the concentration matrix to 
infer. Namely, d = — if| 2 s[ 2 S^ L1 1 Si2. 

Consider now the penalized version of the log-likelihood (29): we wish to solve 
p penalized problems of minimization as defined above, which can be written as 
follows 



mm 
/9 



S\{ 2 P + X 22 S u 1/2 si 2 |[ + ^ ||p 12 */3\\ ti . (31) 



Mcinshauscn and Buhlmann wish to solve p LASSO-problems, for instance for 
the last variable p, 

1 i 
min - ||Xp - X\_a + ||2n _1 p 12 * all , (32) 

an 1 

where X p is the pth column of X and X\ p is the matrix of data the pth column 
has been removed (note that we adapted the penalization term corresponding 
to the framework developed here). 

The minimum is reached in (31) for null-differentiation, and we get 

2Sll/3 + 2^228^ + — - Pl2 *v = 0, 

n 

where v £ sign(/3). The same for (32), and we get 

^X( p X Vp « - ^XTX Xp + 2n" W 7 = 0, 

where 7 €E sign(a). Now, just note that n _1 Xy p X\p = Sn and n^ 1 ^KJ l 'X.\ p = 
s{ 2 , and problems (31) and (32) are equivalent, provided that a = — /3/X 22 . 

Thus, the columns of the concentration matrix (with a removed diagonal 
term) inferred from the penalized maximum pseudo-likelihood problem (29), 
and those inferred with Mcinshausen and Buhlmann's approach, share exactly 
the same null-entries, that is, the same network of conditional dependencies. □ 
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5. Numerical experiments 

In this section wc present numerical experiments on both synthetic data, to in- 
vestigate how well the proposed selection procedure behaves, and real data, to 
demonstrate the practical use of GGM covariance selection with latent struc- 
ture. In the remainder of this section we focus on an affiliation model (5), the 
choice of the penalty being made in line with Section 3.3. More precisely, we fix 
the ratio Ai n /A ou t = 1.2 and either let the value 1/Aj n vary when considering 
precision/recall curves for synthetic data, or fix this parameter according to (28) 
when dealing with real data. 

5.1. Synthetic data 

We perform numerical experiments to assess the performance of our approach 
(SIMoNe, Statistical Inference for Modular Network) and compare it to already 
existing methods for GGM covariance selection: GLasso (Friedman et al. 2008) 
and GeneNet (Schafer and Strimmer 2005). 

Data synthesis in our framework requires the simulation of a structured sparse 
inverse covariance matrix. To this aim, we first simulate a graph with an affili- 
ation structure. We consider a simple binary affiliation model where two types 
of edges exist: edges between nodes of the same class and edges between nodes 
of different classes. The binary incidence matrix of the graph is transformed 
by randomly flipping the sign of some elements in order to simulate both pos- 
itively and negatively correlated variables. Positive definiteness of this matrix 
is ensured by adding a large enough constant to the diagonal. The matrix is 
then further normalized to have a diagonal of ones. A Gaussian sample of size 
n with zero mean and the above covariance matrix is then simulated 50 times. 
The results we present below are averaged over the 50 samples. At the end of 
this section we discuss the performances of our method when there is no latent 
structure on the data. 




(a) (b) (c) 

Fig 1. Simulation of the structured sparse concentration matrix. Adjacency matrix without 
(a) and with (b) rows and columns reorganized according the affiliation structure and corre- 
sponding graph (c). 

We simulate sparse graphs with p = 200 and n from 100 to 2000 (n/p E 
{1/2,2,3,6,10}). We use a probability of intra-cluster connection of 0.125, a 
probability of inter-cluster connection of 0.0025, Q — 3 groups and equal group 
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proportions on = 1/3. With these settings, the theoretical expected number of 
edges is about 862 and the total number of potential edges is 19900. A sample 
graph is given in Figure 1. The running times of GLasso and SIMoNe are of the 
same order. For the settings described above the running time varies from a few 
seconds to a few minutes, according to the penalty parameter. 

We focus the experiments on the ability to recover existing edges of the net- 
work, that is the nonzero entries of the concentration matrix. This is a binary 
decision problem where the compared algorithms are considered as classifiers. 
The decision made by a binary classifier can be summarized using four num- 
bers: True Positives (TP), False Positive (FP), True Negatives (TN) and False 
Negatives (FN). We have chosen to draw precision/recall curves to display this 
information and compare how well the methods perform (Figure 2). 

Precision (TP /(TP + FP)) is the ratio of the number of true nonzero el- 
ements to the total number of nonzero elements in the estimated concentra- 
tion matrix K. Recall that (TP/(TP + FN)) is the ratio of true nonzero el- 
ements in K to all nonzero entries of the real concentration matrix K. In a 
sparse context where the number of actual positives (TP + FN) is small com- 
pared to the number of actual negatives (FP + TN), precision/recall curves 
give a more informative picture of an algorithm's performance than classical 
Receiver Operator Characteristic (ROC) curves. Indeed, ROC curves plot the 
False Positive Rate (FPR = FP/(FP + TN)) against the True Positive Rate 
(TPR = TP / (TP + FN)). When the number of total positives is small com- 
pared to the number of total negatives, small variations of FP and TP will 
result in small variations of FPR and large variations of TPR, which is not 
relevant for comparing performances. In a statistical framework, the recall is 
equivalent to the power and the precision is equivalent to one minus the False 
Discovery Proportion. 

Additionally to the GLasso (Friedman et al. 2008) and GeneNet (Schafer and 
Strimmer 2005) we consider two other procedures: 

• When n is greater than p, a straightforward way to obtain an estimate of 
the inverse covariance matrix is to invert the empirical covariance matrix. 
Although this approach is unlikely to perform well in a selection context 
(since it is designed for estimation purposes), it is worth comparing it to 
its competitors in order to assess the scale of improvement. We call this 
procedure InvCor. 

• When the latent structure Z of the concentration matrix is known, our 
method can be applied without its E-step and produce a relevant selec- 
tion of the nonzero entries of the concentration matrix. This approach 
represents the upper limit of our method, since it makes use of an usu- 
ally unavailable source of information. This procedure is denoted perfect 
SIMoNe. 

In some problems the latent structure of the graph is partially known and 
this information can be used in the E-stcp to improve the estimation of the 
latent structure. For example, when inferring gene regulation networks, a 
subset of identified genes may be known to belong to the same functional 
module. 

The approach of Meinshauscn and Buhlmann (2006) was also tested. The 
principle of this approach, and the performances obtained are close to those of 
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GLasso, but it was always slightly outperformed. We have therefore decided, for 
the sake of brevity, to report only the four previously described procedures. 

For the methods based on penalization (GLasso, SIMoNe and Perfect SIMoNe), 
the precision/recall curves are plotted by varying the penalty parameter (namely 
1/Ai n in our case). The penalty parameter varies from close to zero to a maxi- 
mum value which forces all off-diagonal elements of K to be null (see Appendix 
A. 6). The GeneNet and InvCor methods are plotted by sorting the elements of 
K according to their absolute values, and choosing different thresholds to find 
nonzero entries. 

Even when n is really greater than p (Figures 2 (a-b)) Invcor is always 
dominated by the other methods from a selection point of view. This simple 
check shows that even in a favorable context with abundant data, penalization 
procedures improve the selection of nonzero entries of the concentration matrix, 
in comparison with methods based on estimation of these entries. 

Although GeneNet and GLasso can provide different results on a given run, 
both methods perform similarly on average (50 runs for our experiment). The 
only parameter we change in this experimental setting is the n/p ratio. 

Perfect SIMoNe's curves dominate all other curves for any n/p ratio. This 
clearly shows that the knowledge of the structure provides a valuable informa- 
tion for selecting the nonzero entries of the concentration matrix. When the 
structure is hidden, the main problem of our approach is then to find a reliable 
estimate of this structure from the initial data. 

Perfect SIMoNe and SIMoNe perform equivalently when n — lOp and when 
the ratio n/p decreases, Perfect SIMoNe tends to outperform SIMoNe more 
clearly. This means that SIMoNe is able to recover the latent structure when 
there is enough data, but does not find a substantial structure when n drops 
below p. 

When p > n, the empirical covariance matrix ceases to be invertiblc. Thus, 
Figures 2 (e-f) do not display the InvCor results. Although it is possible to show 
that both GLasso and SIMoNe increase the number of inferred true nonzero 
elements with the number of iterations in all settings, precision/recall curves 
show the relative poor performances for all tested algorithms when p > n. 

Notice that when p > n, the estimated latent structure is not reliable. Never- 
theless, the performance of SIMoNe remains comparable to that of GLasso. We 
can therefore see that assuming the existence of a latent structure when there 
is none does not impair the selection of nonzero entries of the matrix K. 

5.2. Breast Cancer data 

We tested our algorithm on a gene expression data set provided by Hess ct al. 
(2006) and concerning 133 patients with stage I — III breast cancer. The pa- 
tients were treated with chemotherapy prior to surgery. Patient response to the 
treatment is classified as either a pathologic complete response (pCR) or a resid- 
ual disease (not-pCR). Hess et al. (2006) and Natowicz et al. (2008) developed 
and tested a multigene predictor for treatment response on this data set. They 
focused on a set of 26 genes having a high predictive value (see Table 1). We 
thus consider a total of n = 133 cases containing p = 26 gene expression levels. 

When dealing with gene regulatory networks, we typically observe n inde- 
pendent microarray experiments, each giving the expression levels of the same 
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Fig 2. Precision/recall curves comparing the performance of GeneNet, GLasso, SIMoNe and 
perfect SIMoNe, when inferring the structure of a simulated graph with p = 200 variables. 
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p genes. If the same experimental conditions are used for all microarrays, these 
may be considered as a sample of the same experiment. In the application in 
question, cases from the pCR class (34 cases) and from the not-pCR class (99 
cases) clearly do not have the same distribution. We apply our algorithm on 
each class of patients. Two distinct gene regulatory networks are thus inferred. 

Figure 3 plots the resulting networks obtained for three different penaliza- 
tions. The penalization parameters were heuristically chosen from the number of 
expected nonzero entries. We used Q — 2 latent clusters, and it is interesting to 
note that when assuming more than two clusters, the algorithm systematically 
produces exactly two non-empty clusters. 

The inferred networks exhibit very different structures according to the class 
of patients. This in itself is interesting and suggests that gene regulation differs 
with respect to the presence or absence of a pCR. 

The network obtained with not-pCR cases displays a two-star pattern. Each 
star connects to a unique gene, either SCUBE2 or IGFBP4. Almost all the most 
significant connections imply SCUBE2. This star pattern suggests that further 
studies of this particular gene would be of interest for understanding residual 
disease. 

The network estimated with the pCR cases has a different two-cluster struc- 
ture. In particular, it groups IGFBP4 and SCUBE2 in the same cluster with a 
direct significant link. This again indicates a completely different relationship 
between the genes in pCR versus non-pCR. 



Gene symbol 


Gene name 


MAPT 


Microtubule-associatcd protein 


BBS4 


Bardet-Biedl syndrome 4 


THRAP2 


Thyroid hormone receptor associated protein 2 


MBTP-S1 


Hypothetical protein 


PDGFRA 


Human clone 23,948 mRNA sequence 


ZNF552 


Zinc finger protein 552 


RAMP1 


Receptor (calcitonin) activity modifying protein 1 


BECN1 


Bcclin 1 (coiled-coil, myosin-likeBCL2 interacting protein) 


BTG3 


BTG family, member 3 


SCUBE2 


Signal peptide, CUB domain, EGF-like 2 


MELK 


Maternal embryonic leucine zipper kinase 


AMFR 


Autocrine motility factor receptor 


CTNND2 


Catenin, delta 2 


GAMT 


Guanidinoacetatc N-mcthyl transferase 


CA12 


Carbonic anhydrase XII 


FGFRIOP 


FGFR1 oncogene partner 


KIAA1467 


KIAA1467 protein 


MTRN 


Meteorin, glial cell differentiation regulator 


FLJ10916 


Hypothetical protein FLJ10916 


E2F3 


E2F transcription factor 3 


ERBB4 


V-erb-a erythroblastic lcukcmiaviral oncogene homolog 4(avian) 


JMJD2B 


Jumonji domain containing 2B 


RRM2 


Ribonucleotide reductase M2polypcptide 


FLJ12650 


Hypothetical protein FLJ12650 


GFRA1 


GDNF family receptor 1 


IGFBP4 


Insulin-like growth factor binding protein 4 




Table 1 




The key genes that composed the inferred networks. 
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Fig 3. Inferred graphs for three different penalization's levels. 
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Appendix A: Appendix section 

A . 1 . Proof of the equivalence between the constraints 

When ||Ug£|| < 1, we have for each couple G V 2 , 

q,e q,e 

Thus HU^IU < 1 ^ ||(S - S) - /P^Hoo < 1- 

On the other hand, assume that ||(S — S) • /P-rlloo < 1, that is, for all i, j <E 
V, we have 

-P TiTj < (S - S), • < P TiTj . 
This also means that there exists some Sij € [0,1] such that 

(S - S) tf = SijP^ + (1 - SiM-P^,) = I £(25y - l)^. 

11 q,i 

We choose XJ q i such that (U q i)ij — (2Sij — 1) for all qj, e Q. Then, since 
e [0, 1], we have 

-i<(U^)y<i, Vi.jeP, 

which proves that ||(S - S) • /P-rW^ < 1 => HlVH^ < 1. 
A. 2. Fixed-point study 

Let us first introduce some notation. For any i,j£V and any q, £ € Q, consider 
the random variables 

\Kii\ 

L ijqt = +log2A 9£ . 



Let u : R py — » be defined by its coordinate functions u — (u iq ) ie -p_ q<£ Q in 
the following way 



Va = {a iq ) ie v, q eQ € 

u iq (a) = a q exp j - ^ ^ a jZ L ijqt\ 

Ki. 
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and let g = (g iq ) ie v, q eQ ■ ^ pQ -> ^ pQ satisfy 
Va e M pQ , ft,(o) 



According to Proposition 3, the optimal parameter r is a fixed-point of g. 
Now, let 

© = {a = (a iq ) ieVtqea €W Q ;VieV,q€ Q,a iq € [0,1] and = l}. 

9 

We wish to study the fixed-points of g in 0. First, let us note that as is a 
compact state space and as the function g satisfies g : — > and is continuous, 
the existence of a fixed-point of g follows from Brouwer's Theorem. 

We now restrict our attention to a smaller set than the whole state space 0. 
For any e > 0, let 

e £ =|fl£ 0,Vi eV,qe Q,a iq e [e,l- £ ]}. 

Note that we do not claim that g : e — > e . However, the existence of a fixed- 
point of g is ensured in and if we assume a q > for any q € Q (which is a 
reasonable assumption if the number of classes Q is not too large), it can easily 
be seen that any fixed-point satisfies ai q > 0, for any i S V and any g £ Q. 
Thus for sufficiently small e > 0, the fixed-points of g belong to 6 . 

In order to study the behaviour of g in the vicinity of a fixed-point, we need 
to look at some kind of contraction property for g. To this end we introduce a 
distance d on e that will make use of the form of the state space e . For all 
aie e , denote by ai = (ai q ) qe Q £ and bi = (bi q ) q eQ & R9. Moreover, let 

At u\ At u \ i f ma,x qeQ a iq /b iq \ I ' a iq b it 
d(a, o) = max do (a;, 0;) = max log — : — - I — max max log 1 



iev ' iev \mm qeQ a iq /b iq J iev q,eeQ \b iq au 

It is well known that do is a distance in [e, 1 — e]®, and it is easy to check that 
the resulting d is also a distance in e . 

Now, fix a, b G R 2 *^ and consider the distance d(g(a) , g(b)) . It is easily checked 
that 

d(g(a),g(b)) = max d (gi(a),gi(b)) = ma,xd (u t (a), u<(6)) = maxd (ttj(a), u<(6)), 
ieP ieP ie'P 

where u = (u,-)j e 'p = (wjg)ie'p. ( jeg is defined in the following way 

Va=(a iq ) ieViqea em. pQ , 

u iq (a) = cxp ajeLijqi} = exp | ^ ^ + log2A ?£ J |. 



In the following, fix e > and a, 6 g O e and denote by 

Qici i ®i q 

I — , c 2 = max - — 
With these notations, we have 



Vi £P, c\ = .. , , 

q£Q b t „ qeQ bi, 



d(a,b) = maxd"o(aj, 6j) = maxlog ( — J . (33) 
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We only consider the affiliation model described in (5). Thus, there are only 
two different values for X q i, namely Aj n and A out for intra and extra cluster 
connectivity 

Lemma 3. If for any i, j € V \i ^ j and any A € {Xi„, \ ou t\, we have 

< + log 2A < — ^— almost surely, (34) 

A 2(p-l)(l + e) 

then the function g satisfies a contraction property on E . 

Before proving the lemma, let us explain the consequences of this result. 
Consider the function defined on (0, +oo) by 

\K\ 

h K {\) = ^ + log2A. 

This function first decreases from +oo to the value 1 + log 2 1 If | on the interval 
(0, \K\) and then increases from 1 + log 2\K | to +oo on (\K\, +oo). 

At any step of the algorithm, if the current values K\™^ of the concentration 
matrix are small enough, namely smaller than l/(2e) ~ 0.184 then the functions 
h K (m) take all the values between l+log2|if | < and +oo. Thus, there is room 

ij 

for choosing Ai n , A out such that (34) is satisfied. In such a case, the fixed-point 
we are looking for is unique and the iterative procedure setting t^ s+1 - ) = g(r^) 
converges. 

Proof. Using that for any j € V and any i s Q, we have c\bji < a^i < c^bji 
and Lij q i > 0, we get 



exp [ ^2 c i ^2 b 3* L v<ie - u iq( a ) < ex P 1^2 °2 ^2 b 3t L vie 



Thus, it follows 



'(35) 

In the case of the affiliation model, for fixed i,j € V and q 6 Q, the set of 
random variables {Lij q [\^Q is reduced to only two random values, namely 

L f = \JM + i og 2Ain , L°f = J^i + log 2A out . 

-"Mn A ou t 



For the sake of simplicity, we assume Q — 2 groups (our arguments may be 

, . . M it r>fn-i pfleilv V»p eppn tViat ffnr 



easily generalized to 3 groups or more). Now, denoting L™ ax = max(L-", 1/°"*) 
and LJ" ln = min(L^, L° ut ), it can easily be seen that (for e < 1/2), 



^ /,,,/., .., = (1 - e)L" ax + £ L™ in 
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- W) - exp [ " " e)LTr + eL ^ n} 



ma,x qe QU iq (a)/u iq (b) 



almost surely. Note that if we have Q > 3 groups, explicit bounds can also be 
obtained (their expression is only slightly more complicated). Coming back to 
(35), we get 

exp I £(cj - 1){(1 - e)L™« + eL™*} 

This leads to 

d (ui(a),Ui(b)) = lo Q . 

mm qe QUi q {a)/u iq {b) 

< E( c 2 - !){(! - £ ) L ?r + eL " in } - E( c i - w - £ ) L " in + ei " ax } 

< E -l-s(4+c{-2)} + - c{ + e(4 + 4 - 2)}. 

Finally, recall that d(g(a), g(b)) = maxj do(Ui(a), tt,(6)), leading to 

d(ff(o) ,$(&)) < max { (c* - 1 - £ (c 2 + cj - 2)) V (l - c\ + e(c 2 + c\ - 2)) } 

x maxy(^ ax + i™ n ). 

Now, using the inverse triangle inequality, and the fact that < 1 < c|, we get 
for any i £ V, 

|4 + 4-2| = ||4-i| < 14-41 = 4-4- 

Moreover, we have < c\ — 1 < c\ — c\ and < 1 — c \ < c 2 — c\. This leads to 
d(g(a),g(b)) < (1 + e) max(c 2 - cj) x m^^(I- + L™ in ) 

< (1 + e) max(c 2 - c\) x 2(p - 1) maxl™ x . (36) 
Since a and 6 belong to 6 e , we get that c^,c 2 S [^e" 1 ] and thus 



1, c 



-2 



4 - 4 = exp(logc 2 ) - exp(logc l 1 ) < - log 

In particular, recalling (33), we have 

< max Co - c\ < -d(a, b). 

Coming back to (36), we get 

d{g(a),g(b)) < (1 + £- 1 )2( J3 - 1) (maxL- ax )d(a, 6). (37) 

Now, under assumption (34) the multiplicative random factor (1 + e )2{p — 
1) maxj-^i Lf 1 ^ is strictly smaller than 1. □ 
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A. 3. Proof of Lemma 2 ('Lasso with pathwise coordinate 
optimization ) 

The following is partly based on Friedman ct al. (2007). There are various al- 
gorithms for solving the LASSO problem. When there is just one predictor, the 
LASSO solution is simply given by soft-thresholding (Donoho and Johnstone 
1995). The approach used here is based on iterative soft-thresholding with a 
"partial residual" as a response variable. 

The usual formulation of the LASSO problem is the minimization with respect 
to (3 of the quantity 

2 



1 n ( P \ 

i=i \ j=i J 



(38) 



where (yi)»=i n is a vector of response and (xij)i=i n .j—i ... „ a matrix of 
predictors such that Xij = 0, with no loss of generality. Using a coordinate- 
descent approach, we simply write the problem (38) in the form 

i " / V 

»=i \ fe#i / fe^j 

and minimizing this function with respect to f3j will lead to the solution 

where = J2k^j x ikh{p), the normalizing term satisfies N? — J2i=i x % 
and the function S(x,p) = sgn(x)(|x| — p) + is the soft-thresholding operator. 

This leads to an iterative procedure, repeated on each coordinate of (3 until 
stabilization of the full vector. Note that as each coordinate-wise solution is 
unique, results from Tseng (2001, Theorem 4.1) imply that the procedure con- 
verges. 

Now, we want to apply this approach to solve the problem (21), which can 
be written 

2 

+ ||pi2*/%. (39) 



. 1 
mm - 
P 2 



1 ~l/2 

— S u (3 - V2S n S12 



2 

From the previous lines, the solution for jth entry of (3 is 

&(pia) = S X^fty ( £ n /2 si 2 )« - \ Y,(^n\A(Pi2) , ( P i 2 )j 

Then, using the symmetry of the matrices, it is easy to see that 

X^( s u )ii( s n / s i2)i = Z)^( s u s n / )je( s i2)i = (Sl2)i, 

^1/2 ~l/2 ^ 

)ifc/3fc(Pl2) = Efe^( S ll);/*A(Pl2), 
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Finally, the solution to (21) is computed by updating the jth coordinate of (3 
via 

/MPia) = 25 f(s 13 )j - ^(S u ) j)b ^(p 12 );(p 12 ) j j /(Eu)^, 
and permuting the rows of 53 until convergence. 



A. 4- Reconstruction of the concentration matrix 

At the end of the block- wise resolution algorithm, a solution 53 is available. In 
order to recover K, we simply use the fact that 53K = I. Block-wisely, we get 

Kl2 = — 53xi <T 12^22 = —K22P/2, 

K 22 = 1/(ctx2 - ffLj^n £12) = 1/(^12 - ?I 2 3/2), 

thanks to the fact that <Ti2 = 53n/3/2. 

To perform this inversion, note that we need to stock the successive solutions 
f3 of the penalized regressions along the algorithm. 



A. 5. Pseudo-likelihood of a Gaussian vector 

It is well known that the distribution of Xf conditional on the remaining vari- 
ables XF. is Gaussian with parameters (/^,a-j) given by 

Mi = ^i\iE\i\i X \i> a i = E « - s A* s y\i S i\i- ( 40 ) 
Denoting = (uj, . . . , ^™) T , we get 

p p , 

log£(X; X) = -- J^logoi - ]T ^(X, - m l ) T (X J - m.) + c. 

i=l i=l 1 

It is easy to see that mj = 53^537., ^XlA. Then, 



log£(X;J £ :)=-^log ( r. 



\i\i \i 

v 

2 



i=l 



P 1 

- V — (XTXj - 2XTX u 53ri .531 . + 53 iU 53A^ .XT.Xw53r.{ .531 .) + c. 
/ ^ 2(j. ( 1 \ \A 2 A 2 \ \A l \ l \ \A l A 2 

i=i J 

Note that we have n _1 XjX,; = S^, as well as n _1 XjX\j = Sj\j and n~ XLX^j 
S\i\i- Thus, 

log£(X;K) = -^loga, 



2 



"Ei(S« - 2S Ai S \i\i S I\i + S Ai S w S \Ai S w S I\i) + c - (41) 



i=l 
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Recalling that K = E 1 , and by reordering the rows and columns of the matri- 
ces, as well as using a block-wise notation, this becomes 





X 








'1 




l\l 






/ p _i 



where is the identity matrix with size p—1. In particular, this leads to the 
identity E^-K^ = 1 - E^KT^. Thus 

S It = (1 - V Al K] v )/K %1 . (42) 

In the same way, we can easily get that El t Ka — — Ey^jKl t and 

Using identities (42), (43) and (40), we obtain 

en = (1 - E Ai K 4 y/jr« + E^K^/ifrt = 1/Jf«. (44) 

Now, coming back to (41) and using the identities (42), (43) and (44), we finally 
obtain the desired result 

p v { K 1 \ 

log£(X;K)= ^log^-n^ ^^ + S lVi K lVi + — K, V S W KJ V +c 



i=i 



A. 6. Penalization upper bound 

The following lemma states that if the penalization parameters A^ 1 and Aq 1 
are chosen large enough (according to the observations), then the penalized 
estimator obtained from the LASSO-like iteration step has null entries. 

Lemma 4. If for any i,j£V we have 

^Z^Zjt >" lSijl wheni ^j and f>J|<?4 (45) 

, A q t A An A 

q,i y 

then the solution E = K 1 of problem (15) satisfies K 1 = . 
Proof. The sub-gradient equation arising from (15) gives 



Vj 



l/y = 



and Vi £ 7> 



5', 



A, 



fii = 0, 



where i/y G sgn(_ft' i j) and thus G [—1, 1]. In particular, we have 



Vi + j, 



K 



ij 



< 1^1 andWGP, 



< 



An 



Now, if the set of penalty parameters satisfies the constraint (45), then the 
matrix K" 1 = satisfies the sub-gradient equation. Thus, the conclusion comes 
from uniqueness of the solution to (15). 

□ 



